GWU Data Science Datathon

DC Crash Data Analysis

Tanaya Kavathekar

Abstract

Road accidents has been a critical problem as every year more than 1.2M people die across the globle. There is a pressing need to make use of the data and understand the underlying cause of problem. Road safety issues are complex. There are significant differences in policies within and across the countries. In this analysis, the data from Metropolitan Police Department's (MPD) crash data management system (COBALT) is studied to find relationship between fatality and independent features. The crash data is for DC state.

Each year, more than 1.2 million people die across the globe due to road crashes; there is a pressing need to understand the underlying cause of the problem. As road safety issues are complex; it involves multi-sectorial ranging from the public, stakeholders to the policy makers. Significant differences exist both across and within countries and therefore policies and interventions need to be adapted to the local environment. The effectiveness of interventions requires a multi-disciplinary approach which include enforcement, engineering and psychological and education approaches. While the resources are limited, road safety interventions must not only address the sustainability of the outcomes but also the cost-effectiveness to implement and maintain it. More important, interventions must be evidence-based and can be evaluated over time before it is translated into policy. Hence, the research cannot be done in silo for better addressing the complexity of road safety issues. For sustainability, road safety interventions need to be guided and governed by policy in the implementation and development.

Pipeline

Data Preprocessing

In [20]:
data.describe()
Out[20]:
id crime id person id age
count 5.963810e+05 5.963810e+05 5.963810e+05 426744.000000
mean 4.384924e+08 2.672116e+07 8.506922e+07 38.668302
std 1.721813e+05 1.238390e+06 8.613766e+06 20.897059
min 4.370014e+08 2.341134e+07 1.045383e+07 -7990.000000
25% 4.383433e+08 2.532167e+07 8.474899e+07 27.000000
50% 4.384924e+08 2.680585e+07 8.497752e+07 37.000000
75% 4.386415e+08 2.769386e+07 8.712287e+07 51.000000
max 4.387906e+08 2.872803e+07 9.077153e+07 237.000000

Outlier treatment

Age variables has lot of outliers. From the box plot values below 0 and above 85 are considered as outliers and replaced by na

In [21]:
data['age'] = np.where(data['age']<1, np.nan, data['age'])
fig = px.histogram(data, x="age", marginal="box")
fig.show()
In [50]:
data['age'] = np.where(data['age']>85, np.nan, data['age'])

Missings values detection and treatment

Only age column shows 30% of missing data which is replaced by 0

Column Creation

  • Mask columns boolean fields fatal, impaired, speeding and ticket issued to 1s and 0s
  • Fatality Rate - at an accident level, number of people faced fatal accidents/number of total people involved in the accidents
  • Severity level - Club columns for minor, major and fatal accidents with indicator labels
  • Map US states into 4 regions designated by US census

EDA

In this analysis, fatal is the target variable. As indicated below the dataset is heavily imbalanced which is treated futhur. Only 0.06% (~417) are fatal

In [28]:
print("Percentage of non fatal class {} and fatal class {} ".format(
    round((data.groupby(['fatal']).agg({'fatal':'count'})['fatal'][0]/ data.shape[0])*100, 2), 
    round((data.groupby(['fatal']).agg({'fatal':'count'})['fatal'][1]/ data.shape[0])*100, 2)))

print("Percentage of non major class {} and major class {} ".format(
    round((data.groupby(['major injury']).agg({'major injury':'count'})['major injury'][0]/ data.shape[0])*100, 2), 
    round((data.groupby(['major injury']).agg({'major injury':'count'})['major injury'][1]/ data.shape[0])*100, 2)))

print("Percentage of non minor class {} and minor class {} ".format(
    round((data.groupby(['minor injury']).agg({'minor injury':'count'})['minor injury'][0]/ data.shape[0])*100, 2), 
    round((data.groupby(['minor injury']).agg({'minor injury':'count'})['minor injury'][1]/ data.shape[0])*100, 2)))
Percentage of non fatal class 99.93 and fatal class 0.07 
Percentage of non major class 96.42 and major class 3.58 
Percentage of non minor class 88.93 and minor class 11.07 

How many people involved in the accidents suffered injury?

In [52]:
crime = pd.DataFrame(data[['crime id', 'severity']].value_counts())
crime = crime.reset_index()
crime.columns = ['crime id', 'severity', 'count of people']
fig = px.histogram(crime, x="count of people", color='severity', marginal="box")
fig.show()

Which vehicle type has maximum accidents and most fatal accidents?

In [30]:
vhCount = pd.DataFrame(data[['vehicle type', 'severity']].value_counts())
vhCount.reset_index(inplace=True) 
vhCount.columns = ['vehicle type', 'severity','count']

fig = px.bar(vhCount, x='vehicle type', y='count', color='severity',
             title="Number of accidents by vehicle type",
             template="simple_white")
# fig.layout.update(showlegend=False) 

fig.show()

Which state people travel most to dc or within dc?

In [31]:
vhCount = pd.DataFrame(data[['License state']].value_counts())
vhCount
vhCount.reset_index(inplace=True) 
vhCount.columns = ['License state', 'count']

fig = go.Figure(data=go.Choropleth(
    locations=vhCount['License state'], # Spatial coordinates
    z = vhCount['count'], # Data to be color-coded
    locationmode = 'USA-states', # set of locations match entries in `locations`
    colorscale = 'Reds',
    colorbar_title = "Accidents",
))

fig.update_layout(
    title_text = 'Accidents by State',
    geo_scope='usa', # limite map scope to USA
)

fig.show()

Does the fatality rate of accidents changes for different state licenses?

As people are not familiar with the terrain, we observe that people holding licenses from different states have higher fatality rate

In [32]:
stateFatality = data[data['fatality rate']>0.0000]
stateFatalityAvg = stateFatality.groupby(['License state'], as_index=False).agg({'fatality rate':'mean'})
stateFatalityAvg

fig = go.Figure(data=go.Choropleth(
    locations=stateFatalityAvg['License state'], # Spatial coordinates
    z = stateFatalityAvg['fatality rate'], # Data to be color-coded
    locationmode = 'USA-states', # set of locations match entries in `locations`
    colorscale = 'Reds',
    colorbar_title = "Accidents",
))

fig.update_layout(
    title_text = 'Average fatality rate by State',
    geo_scope='usa', # limite map scope to USA
)

fig.show()

Does age affect the fatality rate of an accident?

In [33]:
# fatality rate and age
fig = px.scatter(data, x="age", y="fatality rate", color='severity')
fig.show()

Which category of the persons type met with maximum accidents and suffered maximum injury?

In [49]:
ptCount = pd.DataFrame(data[['person type', 'severity']].value_counts())
ptCount.reset_index(inplace=True) 
ptCount.columns = ['person type', 'severity','count']

# fatal = ptCount[ptCount['fatal'] == 'Y']
# nonfatal = ptCount[ptCount['fatal'] == 'N']
# # x = ptCount['person type']
# # y = ptCount['count']

fig = px.bar(ptCount, x='person type', y='count', color='severity',
             title="Number of people met with an accident by person type",
             template="simple_white", 
             category_orders={ # replaces default order by column name
                 "person type": ["Driver", "Passenger", "Pedestrian", "Bicyclist"]
            },)

percentage = pd.crosstab(data['person type'], data['severity']).apply(lambda r: (r/r.sum())*100, axis=1)

fig.add_annotation(text="Accidents % <br> Fatal: {}, Major: {}, <br> Minor: {}".format(
    (round(percentage['fatal']['Driver'], 2)),
    (round(percentage['major']['Driver'], 2)),
    (round(percentage['minor']['Driver'], 2))
    ), x="Driver", y=70000, showarrow=False)

fig.add_annotation(text="Accidents % <br> Fatal: {}, Major: {},<br> Minor: {}".format(
    (round(percentage['fatal']['Passenger'], 2)),
    (round(percentage['major']['Passenger'], 2)),
    (round(percentage['minor']['Passenger'], 2))
    ), x="Passenger", y=70000, showarrow=False)
fig.add_annotation(text="Accidents % <br> Fatal: {}, Major: {},<br> Minor: {}".format(
    (round(percentage['fatal']['Pedestrian'], 2)),
    (round(percentage['major']['Pedestrian'], 2)),
    (round(percentage['minor']['Pedestrian'], 2))
    ), x="Pedestrian", y=70000, showarrow=False)
fig.add_annotation(text="Accidents % <br> Fatal: {}, Major: {}, <br> Minor: {}".format(
    (round(percentage['fatal']['Bicyclist'], 2)),
    (round(percentage['major']['Bicyclist'], 2)),
    (round(percentage['minor']['Bicyclist'], 2))
    ), x="Bicyclist", y=70000, showarrow=False)

# fig.layout.update(showlegend=False) 
fig.show()
  1. Data contains 4 types of persons such as driver, passenger, pedestrian and bicyclist
  2. Driver contributes 77% of data entries followed by 19 % passenger, 2 % pedestrians and 0.6% bicyclist
  3. Although, driver has maximum entries only 0.05% accidents are fatal whereas 2.63% and 8.04% are major and minor
  4. Passenger suffered maximum minor injuries in the accidents.
  5. Pedestrian do not frequently meet with an accident but when occurred they have suffered 53% minor injuries, 24% major and 0.7% fatal injuries. They have maximum fatalities compared to other types
  6. Similarly, bicyclist have least entries of accidents but when met with an accident they suffered 60.82% of minor injuries 8% major injuries and 0.22% fatal injuries

Analysis according to person type

Driver

In [35]:
driver = data[data['person type'] == 'Driver']
In [36]:
driver = driver[driver['age'] >1]
fig = px.histogram(driver, x="age", color='severity', marginal="box")
fig.show()
Does vehicle type affect the severity of accidents?
In [37]:
vhCount = pd.DataFrame(driver[['vehicle type', 'severity']].value_counts())
vhCount.reset_index(inplace=True) 
vhCount.columns = ['vehicle type', 'severity','count']

fig = px.bar(vhCount, x='vehicle type', y='count', color='severity',
             title="Number of accidents by vehicle type",
             template="simple_white")
# fig.layout.update(showlegend=False) 

fig.show()

Passenger

In [38]:
passenger = data[data['person type'] == 'Passenger']
In [39]:
passenger = passenger[passenger['age'] >1]
fig = px.histogram(passenger, x="age", color='severity', marginal="box")
fig.show()
In [40]:
vhCount = pd.DataFrame(passenger[['vehicle type', 'severity']].value_counts())
vhCount.reset_index(inplace=True) 
vhCount.columns = ['vehicle type', 'severity','count']

fig = px.bar(vhCount, x='vehicle type', y='count', color='severity',
             title="Number of accidents by vehicle type",
             template="simple_white")
# fig.layout.update(showlegend=False) 

fig.show()

Pedestrian

In [41]:
pedestrian = data[data['person type'] == 'Pedestrian']
In [42]:
pedestrian = pedestrian[pedestrian['age'] >1]
fig = px.histogram(pedestrian, x="age", color='severity', marginal="box")
fig.show()

Model Selection

Random Uniform Under Sampling

Relationship Strengths

In [53]:
dython.nominal.associations(a[['severity', 'fatal', 'fatality rate','vehicle type', 'ticket issued','Region', 'impaired','age', 'speeding', 'person type']], theil_u=True)
Out[53]:
{'corr':                severity     fatal  fatality rate  vehicle type  ticket issued  \
 severity       1.000000  1.000000       0.796636      0.075096       0.066784   
 fatal          1.000000  1.000000       0.796635      0.398397      -0.004870   
 fatality rate  0.796636  0.796635       1.000000      0.363773       0.068440   
 vehicle type   0.122996  0.398397       0.363773      1.000000       0.225724   
 ticket issued  0.066784 -0.004870       0.068440      0.225724       1.000000   
 Region         0.036209  0.233933       0.176967      0.199513       0.106797   
 impaired       0.084922  0.084922       0.069374      0.126110       0.020913   
 age            0.149978  0.112146       0.184820      0.154240       0.121119   
 speeding       0.208334  0.207661       0.262471      0.366904       0.170605   
 person type    0.087195  0.304221       0.265939      0.267597       0.240737   
 
                  Region  impaired       age  speeding  person type  
 severity       0.048642  0.084922  0.149978  0.208334     0.090719  
 fatal          0.233933  0.084922  0.112146  0.207661     0.304221  
 fatality rate  0.176967  0.069374  0.184820  0.262471     0.265939  
 vehicle type   0.438978  0.126110  0.154240  0.366904     0.455998  
 ticket issued  0.106797  0.020913  0.121119  0.170605     0.240737  
 Region         1.000000  0.084674  0.125845  0.086361     0.293907  
 impaired       0.084674  1.000000  0.008789 -0.021951     0.131849  
 age            0.125845  0.008789  1.000000  0.040347     0.482570  
 speeding       0.086361 -0.021951  0.040347  1.000000     0.196817  
 person type    0.379490  0.131849  0.482570  0.196817     1.000000  ,
 'ax': <matplotlib.axes._subplots.AxesSubplot at 0x1ef425cab08>}

Multiple Linear Regression

In [46]:
X = a[['vehicle type', 'age', 'Region', 'speeding', 'person type']]
y = a[['fatality rate']]
X = pd.get_dummies(data = X)
X['age'] = X['age'].fillna(0)

import statsmodels.api as sm
X = sm.add_constant(X) ## let's add an intercept (beta_0) to our model

# Note the difference in argument order
model = sm.OLS(y, X).fit() ## sm.OLS(output, input)
predictions = model.predict(X)

# Print out the statistics
model.summary()
Out[46]:
OLS Regression Results
Dep. Variable: fatality rate R-squared: 0.213
Model: OLS Adj. R-squared: 0.189
Method: Least Squares F-statistic: 9.103
Date: Sun, 04 Apr 2021 Prob (F-statistic): 6.27e-29
Time: 19:57:53 Log-Likelihood: -3873.3
No. Observations: 834 AIC: 7797.
Df Residuals: 809 BIC: 7915.
Df Model: 24
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 15.8479 4.229 3.747 0.000 7.547 24.149
age 0.2195 0.043 5.069 0.000 0.134 0.304
speeding 25.1665 3.985 6.316 0.000 17.345 32.988
vehicle type_Atv (all Terrain Vehicle) 30.7411 24.190 1.271 0.204 -16.741 78.223
vehicle type_Bus -19.8331 8.981 -2.208 0.028 -37.462 -2.204
vehicle type_Cargo Van 5.4682 17.243 0.317 0.751 -28.378 39.314
vehicle type_Drugs/ Narcotics 8.4546 8.991 0.940 0.347 -9.193 26.102
vehicle type_Firearms -8.7411 5.244 -1.667 0.096 -19.035 1.553
vehicle type_Large/heavy Truck -6.9540 9.640 -0.721 0.471 -25.876 11.968
vehicle type_Moped/scooter 13.5547 10.419 1.301 0.194 -6.896 34.006
vehicle type_Motor Cycle 23.6071 5.360 4.404 0.000 13.085 34.129
vehicle type_Motorhome/camper/rv (recreational Vehicle) -18.7633 17.241 -1.088 0.277 -52.605 15.079
vehicle type_None 7.0710 11.085 0.638 0.524 -14.687 28.829
vehicle type_Other Small/light Truck 3.1306 17.321 0.181 0.857 -30.868 37.129
vehicle type_Other Vehicle -3.7756 3.952 -0.955 0.340 -11.533 3.982
vehicle type_Passenger Car/automobile -2.4450 3.201 -0.764 0.445 -8.729 3.839
vehicle type_Passenger Van 0.1005 5.688 0.018 0.986 -11.065 11.266
vehicle type_Pickup Truck -9.4119 7.273 -1.294 0.196 -23.687 4.863
vehicle type_Suv (sport Utility Vehicle) -6.3560 4.726 -1.345 0.179 -15.632 2.920
Region_Midwest -6.0421 8.863 -0.682 0.496 -23.440 11.356
Region_Northeast -9.2636 6.807 -1.361 0.174 -22.625 4.098
Region_South -0.7646 3.498 -0.219 0.827 -7.631 6.101
Region_West -11.6443 15.190 -0.767 0.444 -41.462 18.173
person type_Bicyclist 8.2567 6.855 1.205 0.229 -5.199 21.712
person type_Driver -0.8723 4.527 -0.193 0.847 -9.758 8.013
person type_Passenger -1.0911 4.863 -0.224 0.823 -10.637 8.455
person type_Pedestrian 9.5546 8.172 1.169 0.243 -6.487 25.596
Omnibus: 198.884 Durbin-Watson: 1.006
Prob(Omnibus): 0.000 Jarque-Bera (JB): 406.420
Skew: 1.344 Prob(JB): 5.59e-89
Kurtosis: 5.115 Cond. No. 5.06e+16


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 4.32e-28. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

Random Forest

In [48]:
getRF(a)
F1 Score: 0.614

Discussion